Saturating Splines and Feature Selection

We extend the adaptive regression spline model by incorporating saturation, the natural requirement that a function extend as a constant outside a certain range. We fit saturating splines to data via a convex optimization problem over a space of measures, which we solve using an efficient algorithm based on the conditional gradient method. Unlike many existing approaches, our algorithm solves the original infinite-dimensional (for splines of degree at least two) optimization problem without pre-specified knot locations. We then adapt our algorithm to fit generalized additive models with saturating splines as coordinate functions and show that the saturation requirement allows our model to simultaneously perform feature selection and nonlinear function fitting. Finally, we briefly sketch how the method can be extended to higher order splines and to different requirements on the extension outside the data range.


Introduction
Splines -piecewise polynomials with continuity constraints -are widely used to fit data [1, §5.1].One issue with piecewise polynomials is that they behave erratically beyond their boundary knot points, and (typically) grow without bound outside of that range [1, §5.2].This instability makes extrapolation dangerous; practitioners must take care to avoid querying spline models near or outside of the range of the training data.
Smoothing spline algorithms [2][3][4] ameliorate this problem by fitting natural splines, which reduce to a lower-degree polynomial beyond the boundary knots.The most commonly used varieties of smoothing splines are cubic smoothing splines (degree-three splines that reduce to linear outside the boundary knots) and linear smoothing splines, which extend as constant.The saturating splines we propose are closely related to linear smoothing splines.
Smoothing splines use an 2 or quadratic notion of complexity, and hence fit models with a predetermined and dense set of knot points [1, §5.4].Adaptive regression splines [5], on the other hand, use an 1 -type penalty, which can result in a sparse set of adaptively chosen knots.However, adaptive regression splines do not reduce to lower degree outside of the range of their largest knots, and hence may suffer issues of instability.
We propose fitting adaptive regression splines with explicit constraints on the degree of the spline outside of a certain interval.We call such splines saturating splines.While the approach we take can be extended to fitting splines of arbitrary degree with constraints on arbitrary derivatives, in this paper we focus on fitting linear splines that are flat (constant) outside the data range; we mention the extension to higher degree splines in §8.We show that saturating splines inherit the knot-selection property of adaptive regression splines, while at the same time behave like natural splines near the boundaries of the data.
We also show a very important benefit in the context of fitting generalized additive models [6] with saturating spline coordinate functions: the saturation constraint naturally results in variable selection.Standard adaptive spline (and most other types of spline) generalized additive models do not penalize linear coordinate functions, which means they (almost always) have nonzero responses to every feature.This lack of feature selection can hurt interpretability and, to a certain extent, generalization.The saturation constraint we 1 arXiv:1609.06764v2[stat.ML] 27 Sep 2016 propose precludes linear functions, and in concert with the adaptive spline 1 penalty encourages coordinate functions to be identically zero.As a result, generalized additive models fit with saturating spline component functions often depend on only a few input features.
Like smoothing splines and adaptive regression splines, saturating splines arise as solutions to certain natural functional regression problems.We solve the saturating spline fitting problem by reformulating it as a convex optimization problem over a space of measures: roughly speaking, the second derivative of the fitted function.To the best of our knowledge, this approach is novel.We then apply a variant of the classical conditional gradient method [7,8] to this problem.At each iteration of our algorithm, an atomic measure is produced; moreover, we can uniformly bound the number of atoms, which corresponds to the number of knot points in the spline function.(While we manipulate atomic measures, we solve the problem over the space of all measures with finite total variation.)In contrast to standard coordinate descent methods, in each iteration of the conditional gradient method the weights of two knot points are adjusted.In the fully corrective step, we solve a finite-dimensional convex optimization problem with 1 and simple linear constraints.Numerical experiments show that the method is extremely effective in practice.
Our optimization method can exploit warm starts, i.e., it can use an initial guess for the fitted function.This allows us to compute an entire regularization path efficiently, at a cost typically just a small multiple of the effort to solve the problem for one value of the regularization parameter.Because our algorithm is based on the conditional gradient method, we can use the framework of [9] to compute a provablysuboptimal approximate regularization path.When fitting generalized additive models, the regularization path has attractive features: at critical values of the regularization parameter, new regressors are brought into (or, occasionally, out of) the model, or new knot points are added to (or deleted from) one of the existing coordinate functions.Thus our approach combines feature selection and knot point selection.

Outline
In §2 we introduce a univariate function fitting problem, inspired by the adaptive spline estimation problem of [5], that includes the additional requirement that the fitted function saturate.In §3 we make the connection between our function estimation problem and standard adaptive splines, and pose the saturating spline fitting problem as a convex optimization problem over measures.In §4 we modify the classical conditional gradient method to solve this optimization problem.In §5 we extend the optimization problem and algorithm to fit generalized additive models to multivariate data.We illustrate the effectiveness of the method with several examples in §7.We discuss generalizations to higher-degree splines in §8.Finally, we discuss potential extensions and variations in §9.The appendix includes implementation details and proofs.

Univariate function fitting
We wish to fit a continuous bounded function To do this we will choose f to minimize a data mismatch or loss function subject to a constraint that encourages regularity in f , and an additional constraint, saturation, that we describe below.
The loss is given by where : R × Y → R is nonnegative, twice differentiable, and strictly convex in its first argument.Typical loss functions include (z, w) The loss L is a convex functional of the function f that only depends on the values of f at the data points x i .The smaller the loss, the better f fits the given data.
We constrain the function f to be simple by limiting the value of a nonnegative regularization functional R. In this paper, we take R to be the total variation of the derivative of f , i.e., the regularization is the 1 norm of the second derivative.(As we review in the following section, the modern definition of total variation extends this equality to nondifferentiable functions.)The total variation limit we impose on f is R(f ) ≤ τ , where τ is a parameter that we use to trade off model fit and model regularity.This regularization constraint implicitly constrains f to be differentiable almost everywhere, with its derivative having finite total variation.Our model f will be subject to one more constraint, that it saturates (outside the interval [0, 1]), which means that it is a (possibly different) constant on the two intervals outside [0, 1]: f (x) = f (0) for x ≤ 0, and f (x) = f (1) for x ≥ 1.In other words, f extends as a constant outside the nominal data range of [0, 1].In terms of the derivative, this is equivalent to the requirement that f exists and is zero outside [0, 1].
The fitting problem is then where τ ≥ 0 is the regularization parameter.The variable to be determined is the function f , which is in the vector space of continuous functions with derivatives of finite total variation.This fitting problem is an infinite-dimensional convex optimization problem.
In applications the problem ( 2) is solved for a range of values of τ , which yields the regularization path.The final model is selected using a hold-out set or cross-validation.For τ = 0, f must be constant and the problem (2) reduces to fitting the best constant to the data.As τ increases, f is less constrained, and our fitted model becomes more complex; eventually we expect overfitting.For example, in the case of regression, with a loss function that satisfies (u, u) = 0 and data with distinct x i , the fitting function is the piecewise-linear function that interpolates the data, for large enough τ .

Splines and functions of bounded variation
In this section we explore the connection between our fitting problem and degree-one splines, i.e., piecewiselinear continuous functions, which have the form where (z) + = max{z, 0}.We assume that the t i are distinct, and refer to them as knot points or simply knots.The scalars w i are the weights, and c is the offset.We refer to the function x → (x − t i ) + as a hinge function, so a degree-one spline is a finite linear combination of hinge functions, plus a constant.

Functions of bounded variation
A right-continuous function h : [0, 1] → R is of bounded variation if and only if there exists a signed measure µ on [0, 1] with where 1(y ≤ z) = 1 for y ≤ z and 0 otherwise.The measure µ is unique; we can think of it as the derivative of h.That is, (4) is essentially the second fundamental theorem of calculus with h replaced by µ.
We also have TV(h) = d|µ|.(This is called the total variation of the measure µ.)We will denote this using the notation µ 1 , to emphasize the similarity with the finite-dimensional case, or the case when h is differentiable: TV(h) = h 1 .When the measure µ is atomic, the function h is piecewise constant with jumps at the points in the support of µ.

Splines and derivatives with bounded variation
Now suppose that f : [0, 1] → R has a right-continuous derivative of bounded variation.From (4), with h = f , and the fundamental theorem of calculus, we have = f (0) + (x − y) + dµ(y).
This shows that any such function is a (possibly infinite) linear combination of hinge functions, plus a constant (i.e., f (0)).In this case, the measure µ can be thought of as the second derivative of f .When µ is atomic and supported on a finite set, that is, f is a degree-one spline of the form (3), with c = f (0).So degree-one splines correspond exactly to the case where the measure µ (roughly, the second derivative) has finite support.We introduce the notation to denote the function derived from the measure µ.It is, roughly speaking, the double integral of the measure µ, or the (potentially infinite) linear combination of hinge functions associated with the measure µ.The mapping from µ to f µ is linear, and we have TV(f µ ) = µ 1 .A simple example of f µ , its first derivative f µ , and its (atomic measure) second derivative µ is shown in Figure 1.The regularization functional, TV(f µ ), is the sum of the absolute values of the spikes in µ.Note that the (signed) sum of the spikes in µ is zero: that is, dµ = 0, which implies that f µ saturates.

Fitting splines by optimizing over measures
Identifying f = c + f µ , we can solve the fitting problem (2) by minimizing over the bounded measure µ on [0, 1], and the constant c.The measure µ is the second derivative of f , and the constant c corresponds to f (0).The total variation regularization constraint TV(f ) ≤ τ corresponds to µ 1 ≤ τ .The saturation condition holds by construction for x < 0; to ensure that f (x) = 0 for x > 1, we need In other words, saturation of f corresponds to µ having total (net) mass zero.Thus (2) can be rephrased as minimize over the bounded measure µ on [0, 1], and c ∈ R1 .In the above, E x is the linear operator that maps µ to the vector (f µ (x 1 ), . . ., f µ (x n )), given by ( 8).E x is clearly linear, as it is the integral of the function against µ.We will apply the conditional gradient method directly to this problem.
To gain intuition about the optimization problem (9), we can consider it as a infinite-dimensional analogue of the standard lasso [10].The lasso is the solution to the optimization problem minimize Here w is a vector in R d , and A ∈ R (n,d) is a matrix.Ignoring the constant term c, we see that ( 9) looks very similar to (10), where E x plays the role of A; indeed, E x is essentially a matrix with n rows and infinitely many columns.Our intuition from the lasso suggests that there should be solutions of (9) that are sparse, which here means that µ is atomic.In terms of f µ , sparsity means there are solutions of the original functional fitting problem (2) that are degree-one splines.This is indeed the case.Theorem 1 shows that there is a solution of (9) with µ atomic, supported on no more than n + 2 points; in other words, f µ is a degree-one spline with K ≤ n + 2.Moreover, in practice the solution of (9) will exhibit selection, that is, it will be supported on far fewer than n + 2 points.
Theorem 1. Fix x 1 , . . ., x n ∈ [0, 1] and f : R → R with f (right-continuous) of bounded total variation, and f constant outside of [0, 1].Then there exists a degree-one saturating spline f (with an most n + 2 knots) that matches f on x i with TV( f ) ≤ TV(f ).
For the remainder of the paper we will ignore the constant term c.It is not difficult to adapt the algorithms we present to handle the constant term, but doing so does add some notational complexity.It's also possible to minimize out c, as it does not affect the regularization term; the resulting problem is still convex in w.

The conditional gradient method for fitting splines
In this section we outline our algorithm for solving (9) (and therefore also (2)).To that end, we briefly review the classical conditional gradient method [7] and the measure-theoretic version proposed in [8].
The optimization problem we need to solve, ( 9), (without the constant term c) As noted in the last section, ( 11) is a convex optimization problem over a space of measures.We closely follow the approach taken in [8] and apply the conditional gradient method to this problem directly.
The main benefit of this approach is that we can restrict our attention to atomic measures, i.e., µ of the form Measures of this form are easily representable in a computer, by simply storing a list of (w j , t j ) pairs.Theorem 1 ensures that the number of knots we need to store is absolutely bounded, i.e., that our algorithm runs in bounded memory.While we manipulate atomic measures, we solve the problem ( 11) over all bounded measures.
One thing to note about finitely-supported atomic measures is that we can easily optimize over the weights w j with the knot locations t j fixed, since this corresponds to a finite-dimensional convex optimization problem amenable to any standard algorithm.Our algorithm makes use of this fact, and alternates between adding pairs of knots and optimizing over the weights w at each iteration.In this latter step knots can be (and indeed eventually must be) removed.In an additional and optional step the knot points can be moved continuously within [0, 1], or to neighboring data points.This step is not needed for theoretical convergence but can improve convergence and the sparsity of the final solution in practice.

The conditional gradient method
The conditional gradient method (CGM) solves constrained convex optimization problems of the form with variable x ∈ R d .In the above, it is always assumed that the (convex) function f is differentiable.At each iteration of the CGM we form the standard linear approximation to the function f at the current iterate Here f (d; x) is the directional derivative of the function f at x in the direction d, defined by Our use of the directional derivative here may seem surprising: for differentiable functions on R d , f (d; x) is always equal to ∇f (x), d .The direct applicability of directional derivatives to convex functions of measures motivates us to prefer the directional derivative.Convexity of f implies that f is a lower bound on f , that is: In the next step of the CGM, we minimize this first-order approximation over the feasible set C: An illustration of a single iteration of the conditional gradient method on the function f (x) = x 2 at the point 1  2 .The set C is the interval [−0.25, 1.25], indicated by the solid vertical lines.The first order approximation f (•; 1  2 ) is plotted as the dotted line tangential to f (x) at 1 2 .The conditional gradient s t is the point −0.25.
The point s t is called the conditional gradient of f .Note that s t provides a lower bound on f (x ): In particular, we can bound the sub-optimality of the point x t : One can show (as in [7]) that this bound decreases to zero, which means that it can be used as a (nonheuristic) termination criterion.After determining s t , there are several options for updating x t .In this paper, we will use the fully-corrective variant of the CGM, which chooses x t+1 to minimize f over the convex hull of {s 1 , s 2 , . . ., s t }.

Conditional gradient for measures
In this subsection, we apply the conditional gradient method to the infinite-dimensional problem (11), which we repeat here: minimize First, we'll show that the conditional gradient, i.e., the measure s t , can be chosen to be supported on exactly two points, and is computable in time linear in n.The directional derivative of the objective function in the direction of the measure s at the point µ is given by We can then interchange the inner-product in ∇L(E x µ), E x s with the integral in E x s = ψ(t) ds(t): , g is simply the residual E x µ − y and g, ψ(t) is the correlation between the residual and a single hinge function located at t.A conditional gradient is any solution to the following optimization problem minimize g, ψ(t) ds(t) subject to ds = 0, Without the integral constraint, we would expect there to be a solution to (16) that is a single point-mass: the objective function is the integral of a scalar-valued function against a bounded measure.We'll show that there is always a solution to ( 16) that is supported on exactly two points.Furthermore, we'll show that those two points can be computed in time linear in n.
Let s be any solution to (16).Let S + be the support of the positive part of s , and let S − be its complement.From ds = 0 and s 1 = τ we have We can move all of the positive mass of s to the single point t + given by and all of the negative mass to the point t − given by without increasing the objective value of ( 16).This is clear: g, ψ(t) ≥ g, ψ(t + ) for all t.An analogous argument holds for t − .Thus, we can always take s t as follows: Note that finding t − and t + involves two separate optimization problems over [0, 1] instead of one over [0, 1] × [0, 1].These problems are readily solved by gridding, though in this case they can be solved exactly in time linear in n if we have access to a sorted vector of the data points x i .To see this, we expand the objective function for t + above, If x i are sorted, we can compute the minimizer between each pair of consecutive data points exactly, since this is simply computing the minimizer of a linear functional over an interval.Thus in a single pass over the data we can compute the global minimizer exactly.
Immediately after computing t − and t + we can use (4.1) to bound the suboptimality of µ.
With this choice of conditional gradient, the fully-corrective step is a finite-dimensional convex problem.Fixing the knot locations encountered as conditional gradients so far, t 1 , . . ., t 2k , we can do at least as well as the fully-corrective algorithm by solving the following optimization problem: This is equivalent to the following optimization problem in R 2k : minimize L( j w j E x δ tj ) subject to 1 T w = 0, w 1 ≤ τ.
We can solve this using any of a number of existing algorithms [11,12].In our implementation we use the conditional gradient method with line-search for simplicity.By warm starting with an increasing sequence of τ 's, we can efficiently compute an approximate regularization path.Indeed we can even provide a provably -suboptimal path using the approach of [9].

Convergence
As in the case of ADCG [8] convergence follows immediately from the conditional gradient method proof in general Banach spaces [13,14].
Another method is to transform the iterates µ t into v t ∈ R n by the (linear) mapping We then note that v t correspond exactly to iterates generated by the standard conditional gradient method applied to a certain atomic norm [8,15] problem.

Generalized additive models
One natural application of univariate splines is fitting generalized additive models [6] to multivariate data: That is, fitting a function of the form where each f d is a simple function from R to R (here x d is the dth coordinate of the vector x).We can mimic our approach in the scalar case with the following optimization problem: Here R is the same regularizer used in the scalar case, namely As in the scalar case, one can show that there is always an optimal f with each coordinate function f d a degree-one saturating spline.This allows us to rephrase (19) as an optimization problem over measures.The only change from the scalar case is that the measure is over the set {1, . . ., D} × [0, 1] -each knot is now attached to a particular coordinate.In other words, we search for a function of the following form: We again have equality between the 1 norm of µ and the regularization term: The analogue of ( 11) is then minimize The conditional gradient algorithm from the scalar case generalizes immediately to fitting generalized additive models -the only difference is that we now need to find a pair of knots for the same coordinate.This involves solving d pairs of nonconvex optimization problems over [0, 1] -again this can be done by gridding or by sorting the training data.Saturating splines gain an additional advantage over standard adaptive splines in the generalized additive model case.The addition of the saturation constraint (that f d be constant outside of [0, 1]) naturally leads to variable selection when fitting generalized additive models.What we mean by variable selection is that the functions f d are often exactly 0. This is very different from the standard adaptive spline setup without the saturation constraint.In that case, linear functions, i.e. f d (x d ) = wx d completely escape the regularization, and as a result are essentially always included in the model.Linear functions are not free with saturation constraints (in fact, outside of the function 0, they are not possible).When we solve (20) we simultaneously fit nonlinear coordinate functions while doing variable selection.

Prior and related work
Smoothing splines also have an interpretation as the solution of an infinite-dimensional optimization problem [1, §5.4].In fact, (degree-one) smoothing splines solve where R(f ) = f (x) 2 dx.
The solution to ( 21) is also a degree-one natural spline that saturates outside of [0, 1].However, the solutions to ( 21) and ( 2) are very different.Roughly, ( 21) is analogous to ridge regression, while (2) is analogous to the lasso.That is, (21) fits functions with as many knots as datapoints, while (2) often fits splines with very few knots.
Another type of spline, that is adaptive but does not saturate, are adaptive regression splines [5].These splines also arise as solutions to a functional regression problem: where Note that this is (2) without the saturation constraint.Algorithms for solving (22) (for degree-one splines) are based on an extension of Theorem 1, that shows there is a solution to (22) which is actually supported on the data points x i .Hence a lasso algorithm can be used to find the solution.This also suggests a very simple method to solve our problem (9): we fix the n knot points to be the values of the data x i , and solve the finite-dimensional convex optimization problem to find the weights.While simple coordinate-descent methods like GLMNet [16] will not immediately work because of the saturation constraint, they could be modified to handle the constraint.We outline how this may be done in Appendix B. This method does work, but can be much slower than ours since in practice the number of knots is typically much smaller than n for useful values of the regularization parameter τ , and the finite-dimensional problem with n basis functions is very poorly conditioned.With that said, the algorithm we propose -for the piecewise linear case -can be interpreted as a forward active set method for the finite dimensional problem, where we avoid explicitly evaluating all basis functions.One advantage of our measure-theoretic approach is that it immediately generalizes to higher-degree splines, where the support of µ need not be on data points, as we will see in §9.In this case ( 9) is truly infinite-dimensional, yet our algorithm can still be directly applied.
Trend filtering is a nonparamteric function estimation technique, first introduced in [17], that is very similar to adaptive splines.Indeed, as discussed in [18], the trend filtering estimate in the constant or piecewise-linear case is exactly the same as the adaptive spline estimate.Trend filtering is increasingly popular as it admits extremely efficient, robust algorithms [18,19].Indeed, some of these algorithms (especially those adapted to fit GAMs [20]) may be adapted to efficiently fit saturating trend filter estimates, which would benefit from the feature selection properties of saturating splines and the computational efficiency of trend filtering.
There are a number of methods for fitting generalized additive models with spline component functions.One approach (taken in [21]) is to use the group-lasso version of (6): Extending this idea, [22] use an overlap group-lasso that facilitates selection between zero, linear and nonlinear terms.The differences between these approaches and ours are analogous to the differences between the standard group-lasso and the lasso.While both do feature selection, the penalty functional ( 6) does not do knot-selection within each coordinate function.
One very similar approach to fitting splines that does not require knot selection (but does not incorporate saturation) is discussed in [23].

Examples
In all examples we affinely preprocess the data so that all training features lie in [0, 1], and apply the same transformation to the test features (which thus may have values outside of [0, 1]).All plots are in terms of the standardized features.

Bone density
We start with a simple univariate dataset from [1, §5.4].The response variable for this dataset is the change in spinal bone density between two doctor visits for female adolescents as a function of age.There are 259 data points, of which we hold out 120 for validation, leaving 139 data points to which we fit a saturating spine.We start with the square loss.
The results are shown in figure 3, for three values of the regularization parameter τ .The scattered points are the training data, the solid line is the saturating spline fit by our algorithm.The figure demonstrates the clear link between τ and the complexity of the optimized spline.Out-of-sample validation suggests setting τ 3.34, which achieves a validation RMSE of 0.036.
To demonstrate that our proposed method works with more general loss functions, we add 30 simulated outliers to the training set and fit with the pseudo-Huber loss [24], a smooth approximation to the Huber loss function given by where δ > 0 is a parameter that interpolates between the absolute value loss and the squared loss.For our experiment we take δ = 0.0015; roughly speaking, the transition between square and linear loss occurs around √ δ = 0.039.The results are shown in figure 4.These plots demonstrate that our algorithm can fit losses other than the square loss, and confirms that the pseudo-Huber loss is far more robust to outliers than the basic square loss function.Indeed, on the validation set the least-squares fit achieves a minimum RMSE of 0.096, while the pseudo-Huber fit achieves 0.038, only slightly worse than the fit obtained before the outliers were added to the training data.While this one-dimensional problem is very easy, it shows the advantage of the adaptive spline penalty: the optimal model has only 5 knot points.

Abalone
We fit a generalized additive model with saturating spline coordinate functions to the Abalone dataset from the UCI Machine Learning Repository [25].The data consists of 4177 observations of 8 features of abalone along with the target variable, the age of the abalone.We hold out 400 data points as a validation set, leaving 3777 data points to fit the model.The first feature (labeled sex) has three values: Male, Female, and Juvenile, which are coded with values 0, 1, 2; the other 7 are (directly) real numbers.The task is to estimate the age of the abalone from the features.Validation suggests we choose τ 200, which achieves a test set RMSE of 2.131.Because the number of features is low, we can plot the entire generalized additive model.Each plot shows one coordinate function f d for d = 1, . . ., 8 as a function of the standardized feature in [0, 1].The coordinate functions are shown for three values of τ , with the middle one corresponding to the value that minimizes test RMSE.When a coordinate function is zero, which means that the feature is not used in the model, it is shown in blue.We can see that in the case of strong regularization (τ = 20), several coordinates are not used; for the best model (τ = 200), all features are used, with a few having only a small effect.It is interesting to see how the sex factors into the optimal model.It is neutral on Male or Female, but subtracts a small fixed amount from its age prediction for a Juvenile abalone.
This dataset is small enough that we can compare against standard adaptive splines fit using a coarse grid of [0, 1].For this experiment, we fit a GAM with standard adaptive spline component functions using GLMNET [16].The standard adaptive GAM fit, which does no variable selection, achieves a validation set RMSE of 2.137, not significantly worse than the saturating spline model.Our algorithm, however, uses many fewer knot points, perhaps due to the poor conditioning of the gridded problem.

Spam
We consider the problem of classifying email into spam/not spam, with a dataset taken from ESL [1].
The dataset consists of 57 word-frequency features from 4601 email messages, along with their labels as spam or not spam.Following the approach in ESL [1] we log-transform the features and use the standard train/validation split, with a training set of size 3065, and test set with 1536 samples.We fit a saturating spline generalized additive model with standard logistic loss.Figure 6 shows the validation error versus the regularization parameter τ , which suggests the choice τ = 500.To show the benefit of nonlinear coordinate functions, we also include the best validation error achieved using a linear model (fit using GLMNet [16]).
With regularization parameter τ = 500, the model selects 55 of the 57 features.We note that our saturating spline generalized additive model modestly outperforms many methods from ESL [1]; for example, smoothing splines yield 5.3% error, while our model has an error rate well below 5%. Figure 7 shows (some of) the coordinate functions for the model with τ = 500.The coordinate functions use very few knots, making them readily interpretable.
For comparison, we fit a GAM with standard adaptive spline coordinate functions.To do so, we grid each dimension with 20 knots and solve the resulting finite-dimensional problem with GLMNET [16].Note that adaptive splines do not penalize linear functions, so there is no feature selection.Adaptive splines achieve a minimum error of 4.8%, significantly worse than saturating splines.

ALS
Using this dataset we try to predict the rate of progression of ALS (amyotrophic lateral sclerosis) in medical patients, as measured by the rate of change in their functional rating score, a measurement of functional impairment.The dataset is split into a training set of 1197 examples and a validation set of 625 additional patients.Each datapoint has dimension 369.We fit a generalized additive model with saturating spline component functions to the data using a least-squares objective function.Following [26, §17.2], we measure performance using mean-squared error.
Figure 8 shows the validation error versus the regularization parameter τ , which suggests the choice τ = 13.On the same plot, we also show the results from [26] using boosted regression trees and random forests.The optimal saturating spline GAM model selects only 50 out of the 369 features, in contrast to boosted regression trees, which use 267.The saturating spline GAM model performs comparably to boosted regression trees and random forests, but uses substantially fewer features, improving interpretability.
Again we fit a GAM with standard adaptive spline coordinate functions (using GLMNET) to show the advantage of saturation.The standard adaptive spline fit achieves an MSE of 0.547, substantially worse than any other model.We speculate that this is because the unpenalized linear functions lead to immediate overfitting.Indeed, removing the unpenalized linear functions and fitting a model with only hinges gives very similar performance to the saturating spline fit, suggesting that the main advantage of saturation for this application is the removal of the unpenalized linear functions.

Higher-degree splines
In the majority of this paper we focused on the functional regression problem (2), with a total variation constraint on the first derivative and a saturation constraint on the zeroth derivative (the function itself).In this section, we consider constraints on higher order derivatives, which lead to solutions that are splines of higher degrees.
We consider the family of nonparametric function estimation problems indexed by 0 ≤ j ≤ k.This is the analogue of the functional regression problem (2) with a total variation constraint on the k-th derivative and a saturation constraint on the (k − j)-th derivative.The saturating spline case from the rest the paper is the special case of ( 23) with k = 1, j = 0. Widely used cubic natural splines correspond to k = 3, j = 1.Note that unlike natural splines, which are only defined for some values of j and k, there are no constraints on j and k.
We now show that higher-degree saturating splines solve (23) in general.As f (k) is of bounded TV, there exists a measure µ s.t.f (k) (x) = 1(t ≤ x) dµ(t).Then we have for some w l .In the above, all iterated integrals take place j times.Note that the constraint that f (k−j) (x) = 0 for all x < 0 implies that the polynomial term, j−1 l=0 w l x l is identically zero.So, we have For x > 1, we can remove the nonlinearity, that is, for x > 1, f (k−j) (x) is simply the integral of a polynomial in x.We can pull terms involving x out of the integral to get a polynomial in x whose coefficients are nonzero multiples of the first j moments of µ: Again, we note that as this polynomial is identically zero for infinitely many points, all of the coefficients must be zero.In terms of the measure µ, this means: t l dµ(t) = 0 for l = 0, . . ., j.
This shows that the constraint that the (k − j)-th derivative of f saturate translates to constraints on all moments of µ up to the j-th moment.
While the conditional gradient step becomes more complex with the addition of more moment constraints, the approach taken in this paper can still be applied to (23) as long as j is fairly small -the conditional gradient step for (23) involves a nonconvex optimization problem over [0, 1] j+2 .This is because we need at least (j + 2) point-masses to satisfy the moment constraints.So, fitting quadratic splines that saturate to linear is very easy -in fact the code to do so is essentially identical to that for fitting piecewise linear saturating splines splines -but fitting quadratic splines that saturate to constant is slightly more difficult due to the additional linear constraint on the measure µ.

Variations, extensions, and conclusion
While saturation is often a natural prior, the approach we take in this paper can also be applied to other (convex) variations on (9).For example, we could add the constraint that the fitted function is monotone nondecreasing, or takes values in a given interval.
A simple algorithmic extension would be to incorporate nonconvex optimization in the spirit of [8].At each iteration we adjust the weights of the atomic measure (w), but we could also adjust the knot locations (t).The objective in (18) is nonconvex in t i , but we can still attempt to find a local minimum.As long as we do not increase the objective function the algorithm is still guaranteed to converge [8].In the case of degree one splines, we can use the fact that the knot points can, without loss of generality, be chosen to be on the data points to make discrete adjustments to the knot locations.
To fit vector-valued functions, for example in multiclass classification, we would need to extend (9) to use vector-valued measures.This is the natural measure-theoretic analogue to the group-lasso.
In multivariate fitting problems with significant interactions between features generalized additive models may underfit.One possible solution is to use single-layer neural networks: i.e. learn functions of the form In the above, v i are constrained to lie in the unit ball.Unfortunately, the conditional gradient step for networks of this form is NP-hard [27].In many practical applications, however, we might expect that the degree of the interaction is bounded.That is, each v i has bounded cardinality.If we assume v i 0 ≤ 2, i.e. we only fit pairwise interactions, we can still apply the conditional gradient method.In this case, the fitting function is a sum of functions of pairs of the variables, formed from the basis elements ((cos θ)x p + (sin θ)x q − t) + , with (continuous) parameters θ and t and (index) parameters p and q (i.e.v = (cos θ)e p + (sin θ)e q ).(This is practical only if d is small enough.)Such functions capture nonlinear relationships between (pairs of) variables.
In this paper we propose a modification of the adaptive spline regression model -namely saturation constraints.We show that saturating splines inherit knot-selection from adaptive splines, and have a very important quality in the context of generalized additive models: feature selection.This allows saturating spline generalized additive models to remain interpretable.We also propose a simple, effective algorithm based on the standard conditional gradient method for solving the saturating spline estimation problem with arbitrary convex losses.Finally, we apply our algorithm to several datasets, demonstrating the simplicity of the resulting models.that s j (x) are as described (and s k (x) = −h k (x)).Now h(x) T w = h(x) T CC −1 w = s(x) T θ, with θ = C −1 w.However, in this case C −1 = C, and hence θ j = w j , j = 1, . . ., k − 1 and θ k = k j=1 w j .In this new basis, imposing the constraint amounts to setting θ k = 0, or simply deleting the last basis function.So fitting the linear model f subject to f (t k ) = 0 is equivalent to fitting the model g without constraints, and in fact the θ j = w j , j = 1, . . ., k − 1.
So in summary, fitting a constrained optimization with the hinge functions is equivalent to fitting an unconstrained optimization with the reduced set of saturating hinges.The remaining question is does this also work with the penalty λ w as in (26).Not quite, but close.It turns out we are still missing a penalty term λ| If we are willing to ignore this last penalty, we can fit the saturated spline model using a fast lasso solver, such as glmnet.By generating such a basis for each variable in a GAM, this same approach can be used to fit a saturated GAM regularization path.
The impact is that one could fit a saturated gam model by running say glmnet on the {s} bases.An example is given on the spam data in figure 11, where the regularization path was computed at a 100 values of λ in seconds.There are of course some caveats.
• If you use all the knots for each of p variables, you end up with a data matrix of dimension N × N p, which does not scale too well.So instead one might use a smaller grid of knots; e.g.map the variables onto [0, 1], and then use a grid of say 50 evenly spaced knots on this grid, including the end knots.
• With a large number of knots, the "variables" are highly correlated, and this can cause numerical issues.The main issue we see is that active sets tend to be larger than they should be.
Nevertheless, this is an alternative algorithm, which is closer in spirit to the adaptive splines algorithm.

C Proofs
Theorem 1. Fix x 1 , . . ., x n ∈ [0, 1] and f : R → R with f (right-continuous) of bounded total variation, and f constant outside of [0, 1].Then there exists a degree-one saturating spline f that matches f on x i with TV( f ) ≤ TV(f ).q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −12 −10 −8 −6 −4 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Proof.Without loss of generality, we will assume f (0) = 0. Let τ = TV(f ).As f has bounded total variation, there exists a measure µ on [0, 1] such that f (x) = f µ :

Log−lambda
That is, f is a spline with infinitely many knots.The idea is to use Caratheodory's theorem for convex hulls to see that, as we only care about µ in terms of its action on a finite number of functions (basically, we only care about the values of f at x i ), we can replace µ with a measure supported on finitely many points.

Figure 1 :
Figure1: f µ and f µ generated by the atomic measure µ (f µ ).The regularization functional, TV(f µ ), is the sum of the absolute values of the spikes in µ.Note that the (signed) sum of the spikes in µ is zero: that is, dµ = 0, which implies that f µ saturates.

Figure 6 :
Figure 6: Validation error for saturating spline generalized additive model fit to Spam dataset versus regularization parameter τ .

Figure 9 :
Figure9: The top two plots show conditional gradients for k = 2 with j = 0 and j = 1 respectively.The dashed lines denote the locations of the point masses: when j = 1, the conditional gradient consists of three point masses.The bottom plots show the corresponding measures.

Figure 11 :
Figure 11: Performance on the spam test data, with 20 knots per variable.Increasing to 50 did not make much difference.Minimum error is 0.047.